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OVERVIEW: 


Architecture of the floating-point unit for the BETA machine. 


Peak Performance of the unit. 
Other features of the Sprint chip. 


FUNCTIONAL DESCRIPTION: 
Major blocks of Sprint chip: 
Interface to memory and the WIL—3132 chip 
Control that comes from Matrix board. 
Interface to the microcontroller. 
Transposers. 
Indirect addressing. 
The Rug. 
The Sprint chip error system and its uses. 
Instructions and Opcodes: 
Sprint Instructions and Opcodes. 
Instruction Sequence Definitions. 
CM chip to Sprint chip communications (bypasses memory). 
Major blocks of the WIL-3132: 
Interface to the Sprint chip. 
Interface to the microcontroller. 
Internals of the WTL-3132. 
Microword Format: 
Microinstructions from the Mircocontroller. 


PROGRAMMING TOOLS AND EXAMPLES: 
Tools and Debugging aids: 
Pyvo and various other macros. 
Debugging aids and helpful hints. 
Programming Restrictions. 
Floating Point: 
Simple floating point instructions. 
Complicated floating-point instructions. 
IEEE exception handling with WTL-3132 chip. 
Transposing: 
Transposition of data by 90 degrees. 
Indirect addressing: 
Layout of arrays. 
Array references and array stores. 
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OVERVIEW: 
Architecture of the floating-point unit for the BETA machine: 


The architecture of the floating-point unit for the BETA machine 
consists of the Sprint chip and a standard of f—the—-shel f 
floating-point processor (WTL-3132>. The Sprint chip octs as an 
interface chip between the Connection Machine memory and the 
floating-point processor chip. The Sprint chip and the 


floating-point processor is shared by 2 Connection Machine chips 
On 32 processors. 


Due to the bit-serial layout of the memory there is special 
hardware on the Sprint chip that allows one to go from 
bit-serial to word-paralle! data arrangement. There are between 
3 to 4 of these special units, from now called transposers. 
Depending on the number of transposers on the chip and having a 
sufficient number of virtual processors a maximum bandwidth can 
be reached for many floating-point instructions (refer to 
FP—-ALGORITHMS for more precise details). 


The floating-point chip (WTL-3132) is a state-of-the art chip. 
ReNcontainicecdeoz pi tmtlodting=ponnt MULTIFEDER, al 52 sbint 
floating-point ALU, a 32x32 bit register file and a 1024x8 
look-up prom which con be used in Newton—Raphson division 
algorithm. Please refer to Weitek WIL-3132 spec for more details. 


Peak performance of the unit: 


The performance of few simple instructions is listed here, for 
more detailed information please refer to the FRP—ALGORITHMS 
spec. 


eee 


*#4e% Put performance numbers here... 
eee 


Together this combination of Sprint chip and the WTL—3132 
floating-point processor makes ao full Connection Machine one of 
the fastest machines. 


Other features of the Sprint chip: 


Another major feature of the each Sprint chip is the ability to 
address its own memory. This means that each Sprint chip can 
address different locations of the memory at the same time. 
This capability increases the performance of array reference 
and array sets by an order of magnitude, since they can all be 
done in parallel vs serial as in the Alpha mochine. 
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FUNCTIONAL DESCRIPTION: 


MAJOR BLOCKS OF SPRINT CHIP: 


Interface to memory and the WTL—3132 chip: 


Instruction Weitek static and dynamic field 
| | 
SP OP | CM OP | 
| | | | 
+——$—$<S 

| | | Pa | 
| [eae alent | 
| <-> |CM Chip|=o | | 
| ee ee) | 
| | | | | 
| | ee | 
| |<—\ —>|CM Chip|-o \1@ | 
| | 22 | |———| | address | 
| | | | | | 
| \44 | | | | 
aH oe peat eee | | 
| GM |<—o \ >|Sprint |<—/—| | 
| |memory | 44 | | | 5 |Wei tek | <——++ 

| | | o->| [<p> | 

Maddr—o-—|- | | ————_ | 32. |__| 


FLOATING-POINT UNIT IN A SYSTEM 


The Sprint chip is connected to the memory bus of 2 CM chips (MB bus), 
the address lines for those 2 chips (AB bus), and a WTL-3132 or 
WTL-3164 floating-point processor chip (FB bus and STATUS bus). 


The memory bus (MB) breaks down into memory—bus—@, memory—bus-1, 
ecc—bus—@ and ecc—bus-—1. Memory—bus—® attaches to the memory 
pins of the Connection Machine chip, the lower 16 bits of the 
Sprint chip memory bus and these also connect to some memory 
chips. It should attach to a CM chip with a lower Chip select 
address from the chip that memory—bus—1 connects to. Ecc—bus-—@ 
connects to the ecc pins of the same CM chip as memory—bus-@. 
The same follows for Ecc—bus—1. 


The float bus (FB) connects to the X port on the WTL—3132 chip. 
On the 3164 it should connect to the low 32 bits on the X port. 


The Status bus connects to a few specific pins on the WTL—3132 
or WTL-3164 chip. The Status bus consists of 4 status pins + 1 
pin for the FREX signal. 


Fpex-Pin => FPEX 

Status—Pin<@> ==> status-@ (on 3164) FPCN (on 3132) 
Status—Pin<1> ==> status—1 (on 3164) FPZERO (on 3132) 
Status-Pin<2> ==> status-2 (on 3164) NC (on 3132) 
Status—Pin<3> ==> status—3 (on 3164) NC Conpois2)) 


The Address bus goes to the address lines of the section 
memories. The pins ore strong enough to directly drive the 
memory chips. The enables on the chips handle the driving of 
these pins. There are 1@ such pins. RAS/CAS and Addr OL- are 
control lines for the address bus. The RAS/CAS can be 
controlled just like the memory chips RAS/CAS lines. 
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Interface to the microcontroller: 


The Sprint interfaces to the microcontroller through the 
instruction and opcode lines. The Instruction bus connects to 
the same signols as the CM chip’s instruction lines. The OP bus 
connects to the bits <5:3> of the OP bus. These opcodes are not 
the same as the CM chips. There is more detailed explanation on 
the opcodes and the instructions further on in this document and 
in the Sprint chip specification. 


The Sprint chip also drives the error pin and the global pin 
back to the microcontroller. The more detailed explanation can 
be found further on and also in the Sprint chip specification. 
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Transposers: 


The sprint chip contains 3 or 4 general purpose "transposers" 
for reorienting dato from serial to parallel and vice versa. 

The variable that tell you how many transposers is 
cmi::*sprint—number—of—transposers*. Transposing is the process 
SitdkinqublcSuineoc me Senile Streams: of S52 bits each and 
putting them out as 32 "parallel" words of 32 bits each. The 
number of bits is the some, but the order is different. This is 
done by a special RAM on the Sprint chip that takes bits in one 
order ond "transposes" them into the other order. Whenever one 
writes to a transposer, one sets some RAM in a COLUMN. Whenever 
one reads a transposer, one reads a ROW. There is one address 
pointer for each transposer. On the write the pointer points to 
the column of the RAM and on the read it points to the row of 
the RAM. ‘he address pointer is 5 bits wide and one can do few 
simple operations on it: nop, post—add, post-clear. It the 
standard mode of operations, loading 32 bits, the pointer wil | 
be post-added and the end it wil! wrap-around to the its 
original value. 


The transposer operation can be further explained with a 
following example: 


Assume we have a 3 by 3 transposer (rather than 32 by 32 as on 
the real chip). Also the address pointer is 2 bit wide. 


At every cycle we will get three bits from memory, those first 3 
bits represent a 3 bit slice across 3 processors. 


Then if one writes the following (A@ is a bit): 
2 1 & <— Column addresses 


A2 Al A@ 


B2 Bi Be 

C2 C1 Ce 

1 eee; 

| | + — Into column address ® of the transposer 
| +—— Into column address 1 of the transposer 
+ Into column address 2 of the transposer 


On the write the address pointer is used as an address into a 
column of a transposer. After each write the pointer will be 
incremented by one and at the end of the operations it will wrap 
around to its original value. 


Now if one reads the data from this transposer we will get the 
following: 


Row oddresses 

v 

@ | A2 A1 A@ <———+ 

1 | B2 B1 Bo <—-+ | 

2aleC2. CleCOr<—t) || 

| 

| | +— Reading row address @ of the transposer 
| +—— Reading row address 1 of the transposer 
+——— Reading row address 2 of the transposer 


On the read the oddress pointer is used as an address into a row 
of a transposer. After each write the pointer will be 
incremented by one and at the end of the operations it will wrap 
around to its original value. Therefore the data is transposed. 
Notice that the data can be transposed back by repeating the 
process. Thus conversion from serial to parallel and parallel 
to seriol is exactly the same process (it is its own inverse). 


The reason for doing this will be apparent in the rest of this 
document. 
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Indirect addressing: 


The indirect addressing capability in the sprint chip allows 
each processor to address a different memory location. This 
will be embodied into a set of move commands at the Paris 
(Rel-—3) level. 


The basic approach to all of these commands is to load up the 
set of oddresses that are desired by the different processors 
into one of the transposers on the Sprint chip. The every 
Sprint chip can command the memory chips address lines by 
sourcing out the values out of that transposer. If the memory is 
being read then the Sprint can record the data into another 
transposer. After the chip hos finished its data collection the 
data is then put back into memory. 


There are couple of registers on the chip that can help protect 
bashing of the VPs on array sets. These registers are in the rug. 


The detailed explanation of the instructions and all the 
dependencies are discussed further on in this document. 
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The RUG: 


Bits in the rug set up the chip modes, error status and address 
pointers. The variable that describes the rug bits is 
emi::*sprint-rug—bits*. Reading and writing the rug is quite 
common. Rug can only be written from or read into BYPASS 
register (a register on the chip) the bypass register in affect 


can be written to and from memory. It takes two cycles to load 
a rug register, but one can create a pipeline trough a BYPASS 
register and it will take n+1 cycles to load n rug registers. 


A whole group of rug accessors are supported in the 
interpret-rug file in the defs system. The rug accessors 
differentiate between BETA chip rug registers and Sprint chip 
Rug registers. 


Refer to instruction sequence definitions for more detailed 
explanation of the operations on the rug registers. 
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The Sprint chip error system and its uses: 


The error system records in the rug what errors occur. 
Depending on the state of the rug some of these errors will be 
signalled with the error pin. This is very much like the Beta 
chip error system (see the beta chip spec). 


Errors can occur from 3 sources: 
Memory ECC errors 


VP violation/Indirect Addressing bounds violation 
Activated Floating Point Exceptions 


WN 


Errors are "signalled" by different pieces of the chip. When an 
error is signalled, it is recorded in the rug. Some signalled 
errors are “reported” or "strobed" onto the error pin based on 
mask bits in the rug. If one of these errors happens, then the 
bit in the corresponding rug bit is set. These rug bits can 
only be reset by writing to the rug. 


In order to "report" any of these errors the rug bit 
:sprint—-error—enable should be enabled first. Then the error 
pin is strobed if one these errors occurs and it is not masked 
by a rug bit corresponding to the individual error that was 
signalled. 


The error pin is strobed on the cycle following the error. The 
error pin is not "sticky" in that it only errors for that one 
cycle after an error occurs. If another error occurs, then the 
error pin is strobed again. 


If the error—enable or mask bits in the rug are changed, the 
effect on the error pin is undefined on that rug cycle. This 
means if an error occurs on the cycle before that writing rug 
cycle, and the rug write turns off the error—enable, that error 
may get to the pin anyway. The cycle after the rug write occurs 
the error should be reporting correctly. This meons that the 
new error—enables (and masks) will be reflected from the rug 
cycle that changed it. 


Memory ECC Errors: 


The chip can detect and correct single bit errors and detects 
but does not correct double bit errors. There various bits in 
the rug that inform one of this fact. 


Activated Floating Point Exceptions Detection: 


For floating point operations exceptions the FPEX pin is brought 
in from the floating-point processor onto the Sprint chip ona 
separate pin. 


There ore several different modes for erroring based on the 
state of this pin as decoded from the rug field 
:signal—f loat-exception: 


When the :signai-float—exception = @ 

NOP: An float-exception error is not signalled under any 
circumstances based on this pin. This means that the 
error—bit in the rug is not latched, and the error pin 
is not strobed. 


When the :signal—float—exception = 1 


Signal if asserted: 
Thus if float-exception [FPEX pin] is asserted AND 


thes instruction bit: status—bus-Write—enable is 
active then an error is signalled (latched into the 
rug). 


Whether the error is "reported" or "strobed" depends 
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on rug bit :error—on—float—exception. 


Vp-errors: 


On any ISOP cycle the :vp-error rug register of the selected 
chip is set to the logical OR of itself and the adder—error and 
comparator—error signals. This register is sticky, so must be 


explicitly cleared if an error occurs. If the rug—-register 
:error—on—vp—error is asserted, then the contents of the 
:vp-error register will reflected in the chip error signal. 
Global: 


Sprint chip can drive the Global line back to the 
microcontroller. The Global can enabled or disabled by the 
:global—enable signal in the rug. If the global is enabled then 
it will be active on two opcodes SOP and ISOP iff 
:float—exception rug bit is set or :vp-error rug bit is active. 
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INSTRUCTIONS AND OPCODES: 
Sprint Instructions and Opcodes: 


This section describes the various instructions, their formats, 
and what they do. 


op—name : NOP SOP LBL LBH RUG ISOP 

OP 2) 1 Zé 3 ~ 5 

@ DC FB-R—-W d<@> d<@> R—-W EVEN—ODD 

1 DC SB-WE d<i> d<1> Addr<@> 

Z DC SB—PC<@> d<2> d<2> Addr<1> 

Xe) DC SB—PC<1> d<3> d<3> Addr<2> 

4 DC FB—-DEVICE<@> d<4> d<4> Addr<3> FB-DEVICE<@> 
5 DC FB—-DEVICE<1> d<5> d<5> MB—-E FB—-DEVICE<1> 
6 DC FB-DEVICE<Z> d<6> d<6> DC FB—DEVICE<2> 
7 DC FB—PC<@> d<7> d<7> DC FB—PC<@> 

8 DC FB—PC<1> d<8> d<8> DC FB—PC<1> 

9 DC DC d<9> d<9> DC DC 

18 DC MB—R—W d<10> d<10> DC MB—R—W 

11 DC MB—PC<@> d<i1> d<11> c MB—PC<@> 

az DC MB—PC<1> d<12> d<12> DC MB—PC<1> 

V3 DC MB—DEV I1CE<@> d<13> d<13> DC MB—DEVICE<@> 
14 DC MB—DEVICE<1> d<14> d<14> DC MB—DEVICE<1> 
15 DC MB-—DEVICE<2> d<15> d<15> DC MB—DEVICE<2> 
NOP: 


Nop does nothing to the state of the chip. No errors are detected or 

signalled. All external busses are undriven. 

MB is not driven 

FB is not driven 

AB is driven based on ADDR-OE. The current matrix board will not let 
this happen. 

Global is not asserted. 


SOE: 


This is the normal instruction during floating point. 

This instruction controls 3 busses of the chip: the memory, 
float, and status busses. This instruction also controls the 
reading and writing of the devices on the chip (see device 
section). The status—bus—write-enable controls the writing of 
the status transposer. The status—bus—pointer—control can effect 
the pointer associated with the status transposer. The effect 
of memory bus is controlled by the memory—bus—device, 


memory—bus—pointer—control, and memory—bus—read—not—-write. If 
the memory—bus—read—not-—write is set (read) then a device, 
specified by memory—bus—device, will be "read" and the memory 
bus will be driven with its contents. If the 
memory—bus—read—not—write is not set (write) then the 
memory—bus—device will be written with the dcta on the 
memory—bus (from memory). The pointer associcted with the 
memory—bus—device will be changed based on 
memory—bus—pointer—control. If the device has no associated 


pointer (this includes the condition register, bypass register, 
and sink-pointer), then the memory—bus—pointer—control field is 
ignored. Similarly, the effect of the float bus has the same 
set of fields with the same control as the memory bus. Other 
control which applies to SOP instructions can be found in the 
rug. Those bits are described in the section where they are 
used. 


MB might be driven (based on MB-R-W), 


FB might be driven (based on FB-R-W), 
Global is asserted based on Error, 


LBL (Load Bypass Low Immediate). 
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This will load the low 2 bytes of the bypass register with the 
data in the instruction. This ts usually done for changing the 
rug, or for loading constants into the Weitek chip. 


MB is not driven 
FB is not driven 
Global is not asserted. 


LBH (Load Bypass High Immediate): 


This will load the high 2 bytes of the bypass register with the 
dota in the instruction. This is usually done for changing the 
rug, or for loading constants into the chip. 


MB is not driven 
FB is not driven 
Global is not asserted. 


RUG: 


This handles the writing and reading of the rug register to and 
from the bypass register. If the read—not-write bit is active 
then the rug is read into the bypass register. Conversely, if 
it is inactive then the rug is written from the bypass register. 
If MB—-E is active and rug—read bit is inoctive then the contents 
of the bypass register from the last cycle are stored in the 
specified rug register and the memory bus is written to the 
bypass register. On the other hand, if MB-E and rug—-read are 
active then the memory—bus is written from the bypass register, 
and the bypass is written with the value from the rug. 


M2 is driven based on MB-E and read—write 
FB is not driven 
Global is not asserted. 


ISOP (Indirect Sprint Operation): 


This is a variation on the SOP instruction used for indirect addressing 
(see Indirect Addressing Section). Reference the SOP instruction 
documentation in this section for all aspects not specified directly below. 


If the EVEN—-ODD bit cf the instruction matches the :EVEN-ODD status bit 
in the rug, then the Sprint chip is said to be selected for this 
operation. The execution of an ISOP instruction is conditional on 
whether the chip is selected. (Note: In normal operation, the :EVEN-ODD 
bits will have been configured so that of the two Sprint chips in 

each section one will be "even" and the other "odd" .) 


If the oddress bus is enabled, by the address—oe pin, then it is 
driven with the sum of the vp—bose register (in the rug) and contents of 
the vp-offset bus from 2 cycles before. 


This opcode enables the VP-error mechanism. A VP-error is signalled if 
the chip is selected AND there is an error based on the address 
currently on the address bus. 


The status bus pointer and the status transposer are never affected by an ISOP. 


If an MB-read is specified, it is executed normally (as if for an SOP 
instruction) if the chip is selected. If the chip is not selected, then 
the bypass register is read onto the memory bus no matter what MB—device 
was specified, and the pointer—control field has no effect. 


If an MB-write is specified, it is executed normally (as if for an SOP 
instruction) if the chip is selected. If the chip is not selected, then 
data is not written to the specified device, or to any device, and the 
pointer—control field has no effect. 
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A read operation is always performed on the specified FB-device, but the 
Float bus itself is not driven. This is done so that the vp-offset bus 
will be supplied with new data. 


MB might be driven (based on MB—-R-W), 


FB 


AB 


is not driven, 


Global is asserted based on Error (see global section), 


is driven based on ADDR-OE. 
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Instruction Sequence Definitions: 


This section describes some instruction sequences as they relote 
to the sprint chip. All the dependencies for the instructions 
are covered here. 


Loading of a TRANSPOSER. 


Simple one cycle load of transposer-a column from some address 
specified by some register SOURCE would look like the following: 


(ui (sop—maddr SOURCE) 
(sop :memory—bus :troansposer—a 
:memory—bus—pointer—control :nop 
:memory—bus :write)) 


In the above cycle 32 bits from a memory location SOURCE got 
loaded into a column of tranposer—a pointed by ao pointer of that 
transposer. We assume here that the pointer was pointing to the 
first column. As you can see at the end of the instruction 
neither the pointer of the tranposer nor the address was 
incremented, hence if this cycle is repeated the same data wil| 
be loaded into the same column of transposer—a. Now lets create 
a loop that will allow us to lood the whole tranposer—a. This 
loop will consists of 32 SOP cycles and couple cycles of 
overhead in the repeat loop. 


(repeat (i (constant 32.)) 
(ui (sop-maddr (SOURCE++) ) 
(sop :memory—bus :transposer—o 
:memory—bus—pointer—control :post—add 
:memory—bus :write))) 


As one can see these 32 SOP cycles load the tranposer—a from 
memory address supplied by register SOURCE. Since the 
transposers pointers are 5 bits wide, after doing 32 cycles they 
will be reset back to zero. 


Transposing data: 


Now that we know how to load data into a transposer lets try to 


transpose data in memory. For this operation we will have to do 
two loops, where in one we will load a tranposer—a and in the 
other we will unload tranposer—a back to memory. 


First load the data from SOURCE into tranposer—a 
(def—min-mic transpose-always (destination source) 


;; Load data into tronsposer—a from SOURCE 
(repeat (i (constant 32.)) 
(ui (sop—maddr (SOURCE++)) 
(sop :memory—bus :transposer—a 
:memory—bus—pointer—control :post—add 
:memory—bus :write))) 


;; Wait one cycle between load and unload, this 
;; is due to the timing constraint on Sprint chip 


(vi) 
Unload data into DESTINATION from tronsposer—a 
(repeat (i (constant 32.)) 


(ui (sop—maddr (DESTINATION++) ) 
(sop :memory—bus :transposer—a 
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:memory—bus—pointer—control :post—add 
:memory—bus :read) 
(we-cnt! hi-lo))) 


As one can see the first in the first 32 cycles the data got 

loaded into transposer—a in the first loop. The we had to wait 

one cycle, this one cycle wait is necessary due to timing 
nstraints of Sprint chip. There is a whole list of timing 
nstraints listed in the Sprint chip specification. 

And in the second loop we unload data from trans oser—a into 

memory pointed by register DESTINATION. The (we-cntl h —lo) 

enables us to latch data into memory of the processors. 


So far we only covered access to the devices from memory bus 

of the Sprint chip, but there exists a second bus float bus 

in the chip. This bus control is totally symmetrical to the 
memory bus. More on this bus wil! be presented after we have 
covered WIL-3132 instructions and interfaces. 

Lets go on and see what other operations there are in the Sprint 
chip. 


Loading of constants into Sprint chip: 


The following operations allow one to load constants into Sprint 
chip, the constant comes over the instruction lines. 


This instruction will load constant 10@ into lower 16 bits 
of the bypass—register on the Sprint chip. 


(ui (Ibl :sprint-Ilbl—data (constant 100.))) 


This instruction will load constant 100 into upper 16 bits 
of the bypass—register on the Sprint chip. 


(ui (lbh :sprint-Ibl—data (constant 1@6.))) 


Using the RUG: 


Bits in the rug set up the chip modes and contain error status. 
Reading and writing the rug is quite common. 


The rug registers are described by the following list in the 
Sprint spec and it’s bound to cmi::*sprint—rug—bits*: 


On sprint chip the access to rug is only through the 


bypass—register. The following sequences show how to read and 
write the rug. 


Lets say we want to clear all of the pointers in the rug. The 
register in the rug that we want to occess then is called :pointer. 
Hence, first we will load the bypass—-register will all zeros and 


then execute a write rug instruction. 


(ui ie :sprint—|lbl—data @)) ;; Load lower 16 bits with zero 
(ui (lbh :sprint-—|bh—-data @)) ;; Load upper 16 bits with zero 
(ui (sprint-rug :register :pointer :rug-read—-write :write)) ;;Write bypass into rug 


As you can see the whole sequence takes 3 cycles. 
Another useful sequence is writing rug to memory. Lets say we 


want to save :configuration register into memory. This is how 
one can accomplish this sequence. 


(ui (sprint-rug :register :configuration :rug—-read—write :read)) 
(ui (sop :memory—bus :bypass—register :memory—bus—direction : read) 
(sop—maddr temp—configuration)) 


In the first cycle the rug register configuration is read out 
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into bypass—register, and in the second cycle the 
bypass—register is read into memory at location specif ed by 
temp—configuration. 


This is just couple of the examples of how access the rug. Many 
different rug occessors are supported in the interpret—rug file 
in the defs system. 
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CM chip to/from Sprint chip communications: 


eRe 


**ee% This section needs to be defined. 
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Major blocks of the WTL—-3132 floating-point unit: 
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Interface to the Sprint chip: 


The Weitek floating point chip is a 32 bit IEEE format floating 
point accelerator. 

It has one bidirectional data port which is connected 

to the Sprint chip float bus, the width of that bus is 32 bits. 
This bus together with Sprint chip allows one to transfer data 
to and from the WTL-3132 register file. The standard way of 
operating on data is to first load it into Sprint chip, then 
transfer the data into WTL-3132 for operations and then transfer 
it back into Sprint chip and back to memory. 


It has thirty-two registers which which can be used as operands 
or destinations of operations, and 3 temporary registers which 
can be used destinations and as sources of the third operand in 
three operand instructions. 

It can do these unary operations — absolute value, float to 24 
BiGeMntedeinw 24 bite integer to float “dnd ia llookup for an 
initial seed to calculate the inverse of o number. The binary 
operations ore add, subtract, negate and add. The three operand 
instructions are multiply and add, multiply negate and add, and 
multiply negate ond subtract. 
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Interface from the microcontroller: 


The WTL—3132 gets its instructions from microcontroller. Due to 
the fact that the floating point processor instruction word 
length is too large to be fitted onto current hardware 
implementation, the instruction is broken into two halves. 

Part of the instruction is called "Dynamic" and the other part 
is "Static". The idea here is that the "dynamic" instruction 
can be changed on every cycle of the operation, but "static" can 
not. 


The "static" instruction contains the following information in it: 


(weitek—-static—instruction :wt!3132-function function 
:c—port enable or disable) 


It has the function specifier and enabling of the c—port 
write enable on the chip. The WTL-3132 has more information on 
the function specifier and on c—port enable. 


The "dynamic" instruction contains the following information in it: 


(weitek—dynamic—instruction :a-oddr register 
:b-oddr register 
:e-addr register 
:io-direction load, store or nop 
:mult—b-input bbus or cbus 
:alu-b-input zero, bbus or temps 
:alu-destination cbus or tmeps) 


This instruction effectively controls the whole WTL—3132 chip. 
At this point you should refer to the WIL—3132 specification for 
more exact details. 
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Internals of the WTL—-3132: 


Now, that you read the WTL-3132 specification we can continue 
with this discussion. 


In the course of normal operations the first instruction that 

gets brodcasted to the chip is "static" with various fields set 

appropriately for the operation, lets assume we want to load the 

chips register @ and register 1 with data from transposer—a and 
hen execute on it (the operation will be and add). 


First we broadcast a "static" instruction. This will set the the 
function field and the c—port disable of the register file. 


(ui (weitek-static-instruction :c-port :disable)) 


This instruction is latched onto the matrix board static latch 
ond at the same time the wIL-3132 stall signal is asserted. The 
stall signal neutralizes the current instruction. This is 
necessary since all we want to do is to preset the static 
fields, but no to execute this instruction (we do not know 
what’s in the dynamic field). 


The next instruction that we want to broadcast is "dynamic" 
instruction, where we will specify the method of loading the 
chips register file and which register. Upon broadcasting this 
instruction it will be latched into dynamic latch on the matrix 
board and the instruction execution will start. 


Since we wanted to load register @ and register 1 the following 
instructions will achieve that: 


;;Load register—@ with value from transposer—a 
(ui (weitek—-dynamic—instruction :c-—addr @ :io-direction :load) 
(sop :float-bus :tranposer—a 
:f loat—bus-—direction :read 
:f loat—bus-pointer—control :post—add)) 


;;Load register—1 with value from transposer—a 
(ui (weitek-dynamic—-instruction :c-addr 1 :io-direction :load) 
(sop :float-bus :tranposer—a 
:f loat—bus—direction :read 
:f loat—bus—pointer—control :post—add)) 


;;Reset io-direction field 
(ui (weitek-dynamic-instruction :c-oddr 31 :io-direction :nop)) 


As we can see the first two cycles will load the data, but why 
do we need that third cycle??? 


The reason for the last cycle ts the following, on the current 
WTL-3132 the stall! signal does not neutralizes the instruction 
if in the to-direction field there is anything but a nop. What 
this means if the next instruction after the last load was a 
static instruction then the instruction would not be stalled and 
it would get executed and bash a register specified in the 
c-addr field (remember dynamic field is latched on the board!!!) 


So now want to operate on these two registers. So, lets enable 
the c-—port field and set the function field approprietly. 
Again we have to do a "static" instruction. 


(ui (weitek-static-instruction :c-—port :enable :wt!3132-function :float—add)) 


Now that we latched the function, specify the registers through 
the "dynamic" instruction. We will operate on registers @ and 1 
and put the result into 2. 


(ui (weitek-dynamic—instruction :a-addr @ 
Sete 
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:alu-b-input :bbus 
:alu-destination :cbus)) 


Now three cycles later the instruction will finish, so during 
these three cycles we want to execute nops on the chip. 

(ui (weitek-dynamic—instruction :c-addr 31 :io-direction :nop)) 
(ui (weitek-dynamic—instruction :c-addr 31 :io-direction :nop)) 
(ui (weitek-—dynamic—instruction :c-addr 31 :io-direction :nop)) 


Now three cycles later the data is written into register 2, now 
execute a store instruction and latch this data into sprint 
chip. Again, since the io-direction is in the "dynamic" 
instruction we want to execute them and we want to specify the 
store in that field. The store cycle is pipelined on WTL-3132 
by one cycle, so in the following sequence we will get the data 
out of register 2 and into sprint chip. 


PPoudnts of Store 
(ui (weitek-dynaomic—instruction :c-addr 2 :io-direction :store)) 


3; Cycle the pipeline 
(ui (weitek—dynamic—instruction :c-addr 31 :io-direction :nop)) 


3; Now latch the data into sprint chip 
;; Again cycle the pipeline 
(ui (weitek—dynamic-instruction :c-addr 31 :io-direction :nop) 
(sop :float—bus :tranposer—c 
:f loat—bus—direction :read 
:f loat—bus—pointer—control :post—add) ) 


;; Now reset "static" fields 
(ui (weitek-static—instruction :c-port :disable)) 


Lets combine all of the above into one sequence. 


;; Disable c—port on the weitek chip 
(ui (weitek-static-—instruction :c-port :disable)) 


;; Load register—@ with value from transposer—a 
(ui (weitek-—dynamic—instruction :c-addr @ :io-direction : load) 
(sop :float-bus :tranposer—o 
:f loat—bus—direction :read 
:f loat-bus—pointer—contro! :post—add)) 


;; Load register—1 with value from transposer—a 
(ui (weitek-dynaomic—instruction :c-addr 1 :io-direction : load) 
(sop :float-—bus :tranposer—a 
:f loat—bus—direction :read 
:f loat—bus—pointer—control :post—add)) 


;; Reset io-direction field 
(ui (weitek-dynamic—instruction :c-addr 31 :ic-direction :nop)) 


;; Enable c—port field and setup the function field 
(ui (weitek-static-instruction :c-port :enable :wtl3132-function :float—add)) 


;; EXECUTE THE ADD INSTRUCTION HERE 
(ui (weitek-—dynamic-instruction :a-addr @ :b-addr 1 :c-addr 2 
:alu-b-input :bbus :alu-destination :cbus)) 


;; WAIT FOR THE DATA 


(ui (weitek-dynamic—instruction :c-addr 31 :io-direction :nop)) 
(ui (weitek-dynamic—instruction :c-addr 31 :io-direction :nop)) 
(ui (weitek-dynamic-instruction :c-addr 31 :io-direction :nop)) 


J) Scart. On Store 
(ui (.eitek-dynamic-instruction :c-addr 2 :io-direction :store)) 
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;; Cycle the pipeline 
(ui (weitek-—dynamic—instruction :c-oddr 31 :io-direction :nop)) 


;; Now latch the data into sprint chip 
;; Again cycle the pipeline 
(ui (weitek-dynamic—instruction :c-addr 31 :io-cirection :nop) 
(sop :float—bus :tranposer-—c 
:f loat—bus—direction :read 
:f loat—bus—pointer—contre! :post—add)) 


;; Now reset "static" fields 
(ui (weitek-static—instruction :c-port :disable)) 


If you can understand the above, then you can write microcode 
for the weitek and sprint chip. 
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Microwo’d Format: 
Microinstructions from the Microcontroller: 


This is the layout of the static and dynamic fields as visible 
by the microcontroller. 


Static Fields (total 14 bits), used 5 bits. 
WTL-3132: 


IeNGeI@CWEN' for || NC | 


8 VA 6 43 @ 
Dynamic Fields (total 24 bits), used all. 
WTL=—31352: 
| AADDR | ADDR | CADDR |MBIN-| ENCN| IO-COUNT | ADST@| ADST1 | ABSEL | 
Zoe 9 16" 14945 ©9 8 7 6 5 = & Z (4) 


The layout of the instruction lines to the Sprint chip is the 
same as the Beta chip. 
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PROGRAMMING TOOLS AND EXAMPLES: 
Tools and Debugging aids: 
PVO and various other macros: 


As one can see it is extremely difficult or program at the low 
level of specifying every instruction for Sprint and weitek. 
What one needs is an abstraction model of the architecture. 
Below the model the provides the most efficient use of the 
bandwidth of the architecture is presented. This model is 
accessed through the PVO macro (PVO stands for pipelined macro 
operations). 


This model has a view of the machine that has two buses, one 
called a memory bus and the other a float bus. Each bus can do 
the following functions: nop, load and store. Each bus is 
connected to some device ot both ends. The memory bus is 
connected on one end to the memory and on other to one of the 
possible 4 devices (transposer—a, transposer—b, transposer—c and 
status-transposer). The float bus on one end is connected to 
one of the four devices and on the other end to the floating 
point processor register file, which in itself if just another 
device. The floating point can do one of four operations: nop, 
load, store and operate. 


All of the busses and the floating point can be controlled at 
the same time. 


The use of this model is the following, it follows the 
FP—ALGORITHMS document very closely. 


1. Memory bus: Load some device from memory bus (operand A) 
Fillogt=, busz .Idile 


lOdime foce Idle 


2. Memory bus: Load some device from memory bus (operand B) 
Float bus: Store device containing operand A onto float bus 
Float Proc: Load data from float bus into Register file. 


3. Memory bus: Idle 
Float bus: Store device containing operand B onto float bus 
Float Proc: Operate on data from float bus and Register 
file put result back into Register file. 


4. Memory bus: Idle 


Float bus: Load device with data from float bus (result C) 
Float Proc: Store data from Register file onto float bus 


5. Memory bus: Store data from device onto memory bus (result C) 
Float = bust Jdilie 
lod trPinoc was ladle 


The way one would write this using the actual PVO macro follows 
below: 


(def—min-mic simple-float-add-always (destination source) 


(f loating—point—macro—instruction—-start 
:wt13132-function :float-add :condition :always) 


(pvo :doc "Load transposer—a with data from source" 
:maddr—pointer source 
:memory—bus :load 
:memory—bus—device :transposer—a) 


(pvo :doc "Load transposer—b with data from destination, 
unload source into Register file" 
:maddr—pointer source 
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:memory—bus :load 
:memory—bus—device :transposer—b 
:float—bus :store 
:float—bus—device :tronsposer—a 
:weitek-operation : load) 


(pvo :doc "Operate on transposer—b ond Register file" 
ot hoat=bus store 
:f loat—bus—device :transposer—b 
:weitek-operation :operate 
:weitek-function :f loat—add) 


(pvo :doc "Unload data from Register file into transposer—c" 
:float—bus :load 
:f loat—bus—device :transposer—c 
:weitek-operation :store) 


(pvo :doc "Store data from transposer—c into memory" 
:maddr—pointer :store 
:memory—bus :store 
:memory—bus—device :transposer—c) 


) 


The above operation will perform a two oddress floating point 
instruction, which will destination and source unconditionally. 
All of the pipelining details are handled by the PVO macro. 

The floating-point—macro—instruction—-start, sets up the pointer 
register in the rug, sets the "dynamic" field, and sets the 
configuration register to operate properly on conditional or 
unconditional operation. 


This model describes the standard operations of the floating 
point processor and the sprint chip. As one can see it follows 
very closely with the FP—algorithms document. 


There are various helpful macros that are written for the Sprint 
chip and Weitek. Most of them located in the file 
wt!l3132-uc.lisp, a lot more useful macros are being implemented 
right now and also PVO is going through a major redesign stage. 
Upon final redesign of PVO the pointer to the documentation file 
will be placed here. 


Debugging aids and helpful hints: 


The following accessors are useful as debugging aids: 


(Read-rug-sprint chip register) — fead rug register 
(Write-rug-sprint maddr register) — write rug register 
(Write-weitek-to-memory-safely maddr) — dump register file to memory 
(Unsigned—read—s! ice-32 chip maddr) — read a 32 bit slice 
(Unsigned-write-slice-32-always maddr) — write ao 32 bit slice 
(Float-read-s!ice-32 chip maddr) — read a 32 bit float slice 


All of the above operate on data slicewise in the machine, where 
each bit of the data is located in each of the 32 processors 
connected to the Sprint chip. 

The following hints are useful: 


Pointers getting out of sync with each other. Remember that 
each transposer has its own pointer. 


Forgetting to setup the configuration register properly. 
Making mistakes in the pipeline of the weitek chip. 
These are some of the common errors that we have ran across. 


There are some that we forgot to mention, as the knowledge of 
common errors accumulated they should be added to this space. 
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restrictions that apply to the 


The following are the current 


Sprimu ship: 


Those boxes marked 


The following chart shows the allowable SOP encodings. 
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The next chart shows which sequences operations are disallowed because of 


illegal. 


Again those marked with an "X" are 


timing constraints. 
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Also, after a MB—-write condition register operation, the bits in 
the condition register cannot be used in the to conditionalize the 
next operation on the next cycle. A no-op or another operation 
that doesn’t depend on the condition register bits must be inserted 
Onathe mext eyclie. 


The weitek chip has the following restrictions: 


D-ADDR is the same as C—ADDR. 

IO—-DIRECTION field must be reset to NOP. 

A dummy fix instruction must be inserted after the real fix 
instruction. 

A-oddr = B-addr = C—addr and JO—direction = :NOP does not work 


There is also a whole list of current Weitek errors, this is 
located in the wt!l3152-errors. lisp file. 
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Floating Point: 
Simple floating point instructions. 
Complicated floating-point instructions. 
IEEE exception hance ing with WTL-3132 chip. 
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eeurwnye 
*#xxe This section is currently under a major design. 
*ee 
“eee Indirect addressing: 
Layout of arrays. 
Array references and array stores. 
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